ggplot2(4)-散点图

数据集

使用的数据信为gcookhook包中的数据,如下所示:

1
2
3
library(ggplot2)
library(gcookbook)
head(heightweight)

数据如下:

1
2
3
4
5
6
7
8
> head(heightweight)
sex ageYear ageMonth heightIn weightLb
1 f 11.92 143 56.3 85.0
2 f 12.92 155 62.3 105.0
3 f 12.75 153 63.3 108.0
4 f 13.42 161 59.0 92.0
5 f 15.92 191 62.5 112.5
6 f 14.25 171 62.5 112.0

基本散点图

用到的函数是geom_point(),默认为黑色圆点:

1
2
3
head(heightweight[,c("ageYear","heightIn")])
# 只显示ageYear与heightIn这两列,等价于heightweight[,c(2,4)]
ggplot(heightweight,aes(x=ageYear,y=heightIn)) + geom_point()

数据如下所示:

1
2
3
4
5
6
7
8
> head(heightweight[,c("ageYear","heightIn")])
ageYear heightIn
1 11.92 56.3
2 12.92 62.3
3 12.75 63.3
4 13.42 59.0
5 15.92 62.5
6 14.25 62.5

mark

改变形状

在geom_point()参数中加入shape=21,改为圆点:

1
ggplot(heightweight,aes(x=ageYear,y=heightIn)) + geom_point(shape=21)

mark

改变大小

在geom_point()参数中加入size=2,将原来的点扩大2倍:

1
2
ggplot(heightweight,aes(x=ageYear,y=heightIn)) +
geom_point(size=2)

mark

还可以改变点的其它属性:其它属性包括点形(shape),颜色(colour)。

分组

以颜色分组

把根据性别(sex)进行分组,用不同的颜色把不同性别表示出来:

1
2
3
4
5
6
7
8
> head(heightweight[,c("sex","ageYear","heightIn")]) # 列出要绘图的列
sex ageYear heightIn
1 f 11.92 56.3
2 f 12.92 62.3
3 f 12.75 63.3
4 f 13.42 59.0
5 f 15.92 62.5
6 f 14.25 62.5

绘图:

1
2
ggplot(heightweight,aes(x=ageYear,y=heightIn,colour=sex)) +
geom_point()

mark

用形状分组

用不同的形状把不同性别表示出来:

1
2
ggplot(heightweight,aes(x=ageYear,y=heightIn,shape=sex)) +
geom_point()

mark

用形状和颜色两个图形属性同时分组

1
ggplot(heightweight,aes(x=ageYear,y=heightIn,shape=sex,colour=sex)) + geom_point()

mark

用内置的函数进行形状与颜色的设置

1
2
3
4
ggplot(heightweight,aes(x=ageYear,y=heightIn,shape=sex,colour=sex)) +
geom_point()+
scale_shape_manual(values=c(1,2)) +
scale_colour_brewer(palette="Set1")

mark

使用不同于默认设置的点型

使用参数geom_point()中的shape=可以设置不同的点型:

1
ggplot(heightweight,aes(x=ageYear,y=heightIn)) + geom_point(shape=3)

mark

用scale_shape_manual()来修饰点型

1
2
3
ggplot(heightweight,aes(x=ageYear,y=heightIn,shape=sex)) +
geom_point(size=3) +
scale_shape_manual(values=c(1,4))

mark

满足某一条的绘图

下面以heightweight数据集为例,在数据集中增加一个用来标识儿童体重是否超过100磅的列:

1
2
3
4
hw <- heightweight
hw$weightGroup <- cut(hw$weightLb,breaks=c(-Inf, 100, Inf),
labels=c("< 100", ">=100"))
head(hw)

最终数据如下所示:

1
2
3
4
5
6
7
8
9
10
11
> hw <- heightweight
> hw$weightGroup <- cut(hw$weightLb,breaks=c(-Inf, 100, Inf),
+ labels=c("< 100", ">=100"))
> head(hw)
sex ageYear ageMonth heightIn weightLb weightGroup
1 f 11.92 143 56.3 85.0 < 100
2 f 12.92 155 62.3 105.0 >=100
3 f 12.75 153 63.3 108.0 >=100
4 f 13.42 161 59.0 92.0 < 100
5 f 15.92 191 62.5 112.5 >=100
6 f 14.25 171 62.5 112.0 >=100

上一段代码中,将hw中的weightLb的数据分开,区间是负无穷(-Inf)到100,100到正无穷(Inf),将分割的结果给weightGroup,其标签为”< 100”, “>=100”

绘图:

1
2
3
4
5
ggplot(hw,aes(x=ageYear,y=heightIn,shape=sex,fill=weightGroup)) +
geom_point(size=2.5)+
scale_shape_manual(values=c(21,24)) +scale_fill_manual(values=c(NA,"black"),guide=guide_legend(override.aes=list(shape=21)))
# guide_legend(override.aes=list(shape=21))指定图例中的点
# scale_shape_manual(values=c(21,24))设置点的

mark

用连续型变量映射到点的颜色或大小属性上

把体重映射到颜色上:

1
ggplot(heightweight,aes(x=ageYear,y=heightIn,colour=weightLb)) + geom_point()

mark

把体重映射到大小上

1
ggplot(heightweight,aes(x=ageYear,y=heightIn,size=weightLb)) + geom_point()

mark

精确绘图控制

连续型图例

将数值最小值的用黑色表示,数值最高的是白色表示,scale_fill_gradient()用以控制颜色的渐变,这样图例是连续型的:

1
2
3
ggplot(heightweight,aes(x=ageYear,y=heightIn,fill=weightLb)) +
geom_point(shape=21,size=2.5) +
scale_fill_gradient(low="black",high="white")

mark

离散型图例

下面的图例是离散的:

1
2
3
ggplot(heightweight,aes(x=ageYear,y=heightIn,fill=weightLb)) +
geom_point(shape=21,size=2.5) +
scale_fill_gradient(low="black",high="white",breaks=seq(70,170,by=20),guide=guide_legend())

mark

避开重复的点

1
2
3
4
ggplot(heightweight,aes(x=ageYear,y=heightIn,size=weightLb,colour=sex)) +
geom_point(alpha=0.5) +
scale_size_area() + # 把相应的数值与点的面积对应起来
scale_colour_brewer(palette="Set1") # 修改颜色

mark

图形重复问题

先看一个案例:

1
2
sp <- ggplot(diamonds,aes(x=carat,y=price))
sp + geom_point()

mark

上面例子中,点比较多有54000个,有重叠的部分。但可以通过调节点的透明度进行调节:

透明度为0.1:

1
2
3
sp <- ggplot(diamonds,aes(x=carat,y=price))
sp + geom_point()
sp + geom_point(alpha=0.1)

mark

另外一种解决数据点重叠的方法

将数据点进行分箱(bin),并以矩形来表示,同时将数据点的密度映射为矩形 填充色,stat_bin2d()的功能在于将数据进行分箱,默认的是把每个坐标轴上对应的点分30个箱子,即把数据集分为900个箱子如下所示:

1
2
sp <- ggplot(diamonds,aes(x=carat,y=price))
sp + stat_bin2d()

mark

分箱数据的进一步优化:

stat_bin2d(bins=52)可以把每个坐标轴的数据点分为52个箱子,如下所示:

1
2
3
sp <- ggplot(diamonds,aes(x=carat,y=price))
sp + stat_bin2d(bins=52) +
scale_fill_gradient(low="lightblue",high="red",limits=c(0,6000))

mark

分箱数据形状的改变:

默认的分箱形状是正方形,可以换一下:

1
2
3
4
library(hexbin) # 用到里面的函数stat_binhex()
sp + stat_binhex() +
scale_fill_gradient(low="lightblue",high="red",
limits=c(0,8000))

mark

上图的图例是连续型变量,现在改为离散型变量:

1
2
3
4
5
sp + stat_binhex() +
scale_fill_gradient(low="lightblue",high="red",
breaks=c(0,250,500,1000,2000,4000,6000),
limits=c(0,6000))+
guides(fill=guide_legend(reverse=FALSE))

mark

x轴是离散型变量

1
2
sp1 <- ggplot(ChickWeight,aes(x=Time,y=weight))
sp1 + geom_point()

mark

上图所示,x轴上的Time是因子型变量,映射到y轴上的点有重复,现在添加geom_point(position=”jitter”)可以添加一定的扰动,避开重复:

1
sp1 + geom_point(position="jitter")

mark

对振动的程度,也可以进行修饰,width是水平方向的扰动,height是垂直方向上的扰动:

1
sp1 + geom_point(position=position_jitter(width=0.1,height=0))

mark

箱线图

如果x轴是因子型变量,y轴是连续型变量,则用箱线图比较合适:

1
sp1 + geom_boxplot(aes(group=Time))

mark

添加回归模型拟合线

基本绘图

用stat_smooth()添加置信域,默认的是95%,如下所示:

1
2
sp <- ggplot(heightweight,aes(x=ageYear,y=heightIn))
sp + geom_point() + stat_smooth(method=lm)

mark

用stat_smooth(method=lm, levels=0.99)可以对置信域进行修改:

1
sp + geom_point() + stat_smooth(method=lm, levels=0.99)

mark

去掉置信区间

1
sp + geom_point() + stat_smooth(method=lm,se=FALSE)

mark

置信区间的颜色设置

将点改为灰色,如下所示:

1
2
sp + geom_point(colour="grey60")+
stat_smooth(method=lm,se=FALSE,colour="black")

mark

置信区间的其他拟合方式:

1
sp + geom_point(colour="grey60") + stat_smooth()

mark

loess拟合,如下所示:

1
sp + geom_point(colour="grey60") + stat_smooth(method=loess)

mark

Logistics回归

数据集

1
2
3
4
5
library(MASS)
b <- biopsy
b$classen[b$class=="benign"] <- 0
b$classen[b$class=="malignant"] <- 1
head(b,10)

数据如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
> head(b,10)
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class classen
1 1000025 5 1 1 1 2 1 3 1 1 benign 0
2 1002945 5 4 4 5 7 10 3 2 1 benign 0
3 1015425 3 1 1 1 2 2 3 1 1 benign 0
4 1016277 6 8 8 1 3 4 3 7 1 benign 0
5 1017023 4 1 1 3 2 1 3 1 1 benign 0
6 1017122 8 10 10 8 7 10 9 7 1 malignant 1
7 1018099 1 1 1 1 2 10 3 1 1 benign 0
8 1018561 2 1 2 1 2 1 3 1 1 benign 0
9 1033078 2 1 1 1 2 1 1 1 5 benign 0
10 1033078 4 2 1 1 2 1 2 1 1 benign 0

绘图:

1
2
3
ggplot(b,aes(x=V1,y=classen))+
geom_point(position=position_jitter(width=0.3,height=0.06),alpha=0.4,shape=21,size=1.5)+
stat_smooth(method=glm,method.args = list(family = "binomial"))

mark

有分组变量的的散点图拟合

1
2
3
4
5
library(RColorBrewer)
sps <- ggplot(heightweight,aes(x=ageYear,y=heightIn,colour=sex))+
geom_point() +
scale_colour_brewer(palette="Set1")
sps + geom_smooth()

mark

基于数据集对拟合线进行外推:

此时需要添加fullrange=TRUE参数,如下所示:

1
2

mark

向已有散点图添加拟合曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
model <- lm(heightIn~ageYear+I(ageYear^2),heightweight)
model
# 创建一个包含变量ageYear的列,并对其进行插值
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear=seq(xmin,xmax,length.out = 1000))
# 计算变量heightIn的预测值
predicted$heightIn <- predict(model,predicted)
head(predicted)
#下面代码是将数据点与模型预测值一起绘制在图形上:
sp <- ggplot(heightweight,aes(x=ageYear,y=heightIn))+
geom_point()
sp + geom_line(data=predicted,size=1)

计算结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> model
Call:
lm(formula = heightIn ~ ageYear + I(ageYear^2), data = heightweight)
Coefficients:
(Intercept) ageYear I(ageYear^2)
-10.3136 8.6673 -0.2478
> # 创建一个包含变量ageYear的列,并对其进行插值
> xmin <- min(heightweight$ageYear)
> xmax <- max(heightweight$ageYear)
> predicted <- data.frame(ageYear=seq(xmin,xmax,length.out = 1000))
> # 计算变量heightIn的预测值
> predicted$heightIn <- predict(model,predicted)
> head(predicted)
ageYear heightIn
1 11.58000 56.82624
2 11.58593 56.84358
3 11.59185 56.86091
4 11.59778 56.87822
5 11.60370 56.89551
6 11.60963 56.91279

mark

定义函数绘制拟合曲线

定义predictvals()函数可以简化向散点图添加模型拟合线的过程。使用时,只需要向其传递一个模型作为参数,该函数就会自动查询变量名、预测变量范围、并返回一个包含预测变量和模型预测值的数据框。将该数据框传递给geom_line()函数,即可绘制出模型拟合线:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 根据模型和变量xvar预测变量yvar
# 仅支持单一预测变量的模型
# xrange:x轴范围,当值为NULL时,等于模型对象中提取的x轴范围;当设定为包含两个数字的向量时,两个数字分别对应于x轴的上下限
# sample: x轴上包含的样本数
# ...:可传递给predict()函数的其他参数:
predictvals <- function(model, xvar,yvar,xrange=NULL,samples=100, ...){
# 如果xrange没有输入,则从模型对象中自动提取x轴范围作为参数
# 提取xrange参数的方法如下所示
if (is.null(xrange)){
if(any(class(model)%in%c("lm","glm")))
xrange <- range(model$model[[xvar]]) else if(any(class(model) %in% "loess"))
xrange <- range(model$x)
}
newdata <- data.frame(x=seq(xrange[1],xrange[2],length.out=samples))
names(newdata) <- xvar
newdata[[yvar]] <- predict(model,newdata=newdata,...)
newdata
}

调用lm()函数与loess()函数可以对数据集建立线性模型和LOESS模型:

1
2
modlinear <- lm(heightIn~ageYear,heightweight)
modloess <- loess(heightIn~ageYear,heightweight)

针对两个模型分别调用predictvals()函数,并将得到的结果(数据框)传递给geom_line():

1
2
3
4
5
lm_predicted <- predictvals(modlinear,"ageYear","heightIn")
loess_predicted <- predictvals(modloess,"ageYear","heightIn")
sp + geom_line(data=lm_predicted,colour="red",size=0.8)+
geom_line(data=loess_predicted,colour="blue",size=0.8)+geom_point()

mark

另外一个案例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(MASS)
b <- biopsy
b$classn[b$class=="benign"] <- 0
b$classn[b$class=="malignant"] <- 1
fitlogistic <- glm(classn ~ V1,b,family=binomial)
# 获得预测值
glm_predicted <- predictvals(fitlogistic,"V1","classn",type="response")
ggplot(b,aes(x=V1,y=classn))+
geom_point(position=position_jitter(width=0.3,height=0.08),alpha=0.4,shape=21,size=1.5)+
geom_line(data=glm_predicted,colour="blue",size=1)

mark

添加来自多个模型的拟合线

1
2
3
4
5
6
7
8
9
10
11
make_model <- function(data){
lm(heightIn~ageYear,data)
}
library(plyr)
models<- dlply(heightweight,"sex",.fun=make_model)
# 有了模型对象之后,配合使用ldply()函数和predictvals()函数即可获取两个模型对应的预测值
predvals <- ldply(models,.fun=predictvals,xvar="ageYear",yvar="heightIn")
head(predvals)

查看数据:

1
2
3
4
5
6
7
8
> head(predvals)
sex ageYear heightIn
1 f 11.58000 57.96250
2 f 11.63980 58.03478
3 f 11.69960 58.10707
4 f 11.75939 58.17936
5 f 11.81919 58.25165
6 f 11.87899 58.32394

绘图:

1
2
3
ggplot(heightweight,aes(x=ageYear,y=heightIn,colour=sex))+
geom_point()+
geom_line(data=predvals)

mark

上图中男性的数据点(即m颜色的点)终止于年龄最大点,女性的数据点(即f的颜色点)终止于女性组中年龄的最大点,为了使两组预测线对应的x轴范围与整个数据集的范围相同,可以向其传递一个xrange参数:

1
2
3
predvals <- ldply(models,.fun=predictvals,xvar="ageYear",yvar="heightIn",xrange=range(heightweight$ageYear))
ggplot(heightweight,aes(x=ageYear,y=heightIn,colour=sex))+geom_point()+geom_line(data=predvals)

mark

向散点图中添加模型系数

如果需要向拟合曲线中添加系数,则如下所示:

1
2
model <- lm(heightIn~ageYear,heightweight)
summary(model)

结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> summary(model)
Call:
lm(formula = heightIn ~ ageYear, data = heightweight)
Residuals:
Min 1Q Median 3Q Max
-8.3517 -1.9006 0.1378 1.9071 8.3371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.4356 1.8281 20.48 <2e-16 ***
ageYear 1.7483 0.1329 13.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.989 on 234 degrees of freedom
Multiple R-squared: 0.4249, Adjusted R-squared: 0.4225
F-statistic: 172.9 on 1 and 234 DF, p-value: < 2.2e-16

由分析结果可知,模型的R^2=0.4249,则可以用annotate()函数将其添加到曲线上:

1
2
3
pred <- predictvals(model,"ageYear","heightIn")
sp <- ggplot(heightweight,aes(x=ageYear,y=heightIn)) + geom_point()+geom_line(data=pred)
sp+annotate("text",label="r^2=0.42",x=16.5,y=52)

mark

如果不想使用纯文本字符串当注释的话,可以通过设置parse=TRUE调用R的数学表达式来输入公式:

1
sp + annotate("text",label="r^2==0.42",parse=TRUE,x=16.5,y=52)

mark

添加公式

1
2
3
4
5
6
7
8
9
10
11
12
13
eqn <- as.character(as.expression(
substitute(italic(y)==a+b*italic(x)*","~~italic(r)^2~"="~r2,
list(a=format(coef(model)[1],digits=3),
b=format(coef(model)[2],digits=3),
r2=format(summary(model)$r.squared,digits=2)))
))
parse(text=eqn)
## expression(italic(y) == "37.4" + "1.75" * italic(x) * "," ~ ~italic(r)^2 ~
## "=" ~ "0.42")
sp + annotate("text",label=eqn,parse=TRUE,x=Inf,y=-Inf,hjust=1.1,vjust=-0.5)

mark

向散点图添加边际地毯

用到的函数是geom_rug(),以faithful数据集为例说明:

1
2
3
4
5
6
7
8
> head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55

绘图:

1
ggplot(faithful,aes(x=eruptions,y=waiting))+geom_point() + geom_rug()

mark

添加扰动,避免重叠

上图中的waiting中的边际地毯线有重叠的(虽然看起来很稀疏,其实都重叠在了一起),因此通过向边际地毯线的位置坐标添加扰动并设定size减少线宽可以减轻边际地毯线的重叠程度。如下所示:

1
ggplot(faithful,aes(x=eruptions,y=waiting))+geom_point()+geom_rug(position="jitter",size=0.2)

mark

向散点图添加标签

用annotate()可以向数据点中添加标签,以countries数据集为例说明,如下所示:

1
2
3
library(gcookbook)
data1 <- subset(countries,Year==2009&healthexp>2000)
head(data1)

数据如下:

1
2
3
4
5
6
7
8
> head(data1)
Name Code Year GDP laborrate healthexp infmortality
254 Andorra AND 2009 NA NA 3089.636 3.1
560 Australia AUS 2009 42130.82 65.2 3867.429 4.2
611 Austria AUT 2009 45555.43 60.4 5037.311 3.6
968 Belgium BEL 2009 43640.20 53.5 5104.019 3.6
1733 Canada CAN 2009 39599.04 67.8 4379.761 5.2
2702 Denmark DNK 2009 55933.35 65.4 6272.729 3.4

绘图:

1
2
sp <- ggplot(data1,aes(x=healthexp,y=infmortality))+geom_point()
sp + annotate("text",x=4350,y=5.4,label="Canada")+annotate("text",x=7400,y=6.8,label="USA")

mark

如果要向每一个数据点添加标签,如下所示:

1
sp + geom_text(aes(label=Name),size=4)

mark

对数据点的标签进行微调

vjust=0时,标签文本的基线会与数据点对齐,设定vjust=1时,标签文本的顶部会与数据点对齐。

1
sp + geom_text(aes(label=Name),vjust=0)

mark

添加y的取值

也可以添加y的映射来进行微调

1
sp + geom_text(aes(y=infmortality+0.1,label=Name),size=4,vjust=0)

mark

左对齐

添加参数hjust=0,如下所示:

1
sp + geom_text(aes(label=Name),size=4,hjust=0)

mark

右对齐

添加参数hjust=1,如下所示:

1
sp + geom_text(aes(label=Name),size=4,hjust=1)

mark

有针对性的增加标签

如果只想给为个别数据点添加标签,但希望R自动设定标签位置的话,可以给数据框增加一个只包含拟使用的标签的新列。一个可行的方案是:首先,将所用数据复制一个副本,并将Name复制为Name1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
data1$Name1 <- data1$Name
idx <- data1$Name1 %in% c("Canada","Ireland","United Kingdom","United States","New Zealand","Iceland","Japan","Luxembourg","Netherlands","Switzerland")
idx
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [12] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE TRUE TRUE TRUE
data1$Name1[!idx] <- NA
# 把需要的标签留下
head(data1)
## Name Code Year GDP laborrate healthexp infmortality Name1
## 254 Andorra AND 2009 NA NA 3089.636 3.1 <NA>
## 560 Australia AUS 2009 42130.82 65.2 3867.429 4.2 <NA>
## 611 Austria AUT 2009 45555.43 60.4 5037.311 3.6 <NA>
## 968 Belgium BEL 2009 43640.20 53.5 5104.019 3.6 <NA>
## 1733 Canada CAN 2009 39599.04 67.8 4379.761 5.2 Canada
## 2702 Denmark DNK 2009 55933.35 65.4 6272.729 3.4 <NA>
ggplot(data1,aes(x=healthexp,y=infmortality))+
geom_point()+
geom_text(aes(x=healthexp+100,label=Name1),size=4,hjust=0)+
xlim(2000,10000)
## Warning in loop_apply(n, do.ply): Removed 17 rows containing missing
## values (geom_text).

mark

注意%in%的功能,此符号的功能是如下所示:

1
2
3
4
5
6
7
8
9
10
x <- c(1,2,3,4,5,5,5,5,6,6,11,12)
y <- c(1,5,12)
x%in%y
## [1] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [12] TRUE
x[x%in%y] # 找出x中包含y的元素
## [1] 1 5 5 5 5 12

可以看出,%in%的功能是在x中查找y,如果有则为TRUE,若无则为FALSE。

气泡图

气泡图的每个面积正比于变量值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
data2 <- subset(countries,Year==2009&Name %in% c("Canada","Ireland","United Kingdom","United States","New Zealand","Iceland","Japan","Luxembourg","Netherlands","Switzerland"))
head(data2)
## Name Code Year GDP laborrate healthexp infmortality
## 1733 Canada CAN 2009 39599.04 67.8 4379.761 5.2
## 4436 Iceland ISL 2009 37972.24 77.5 3130.391 1.7
## 4691 Ireland IRL 2009 49737.93 63.6 4951.845 3.4
## 4946 Japan JPN 2009 39456.44 59.5 3321.466 2.4
## 5864 Luxembourg LUX 2009 106252.24 55.5 8182.855 2.2
## 7088 Netherlands NLD 2009 48068.35 66.1 5163.740 3.8
# 如果把GDP映射给size,则表示把GDP映射给了点的半径,并非面积,如下所示:
p <- ggplot(data2,aes(x=healthexp,y=infmortality,size=GDP))+
geom_point(shape=21,colour="black",fill="cornsilk")
p

mark

如果想要把数据点的值映射给面积,用到的函数是scale_size_area()

1
p + scale_size_area(max_size=15)

mark

其他用法

当x轴与y轴都为分类变量时,气泡图可以用来表示风格点上的变量值,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 以HairEyeColor数据集为例
HairEyeColor
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
hec <- HairEyeColor[,,"Male"] + HairEyeColor[,,"Female"]
library(reshape2)
hec <- melt(hec,value.name="count")
ggplot(hec,aes(x=Eye,y=Hair))+
geom_point(aes(size=count),shape=21,colour="black",fill="cornsilk")+
scale_size_area(max_size=20,guide=FALSE)+
geom_text(aes(y=as.numeric(Hair)-sqrt(count)/22,label=count),vjust=1,colour="grey60",size=4)

mark

绘制散点矩阵图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
c2009 <- subset(countries,Year == 2009,
select = c(Name,GDP,laborrate,healthexp,infmortality))
head(c2009)
## Name GDP laborrate healthexp infmortality
## 50 Afghanistan NA 59.8 50.88597 103.2
## 101 Albania 3772.605 59.5 264.60406 17.2
## 152 Algeria 4022.199 58.5 267.94653 32.0
## 203 American Samoa NA NA NA NA
## 254 Andorra NA NA 3089.63589 3.1
## 305 Angola 4068.576 81.3 203.80787 99.9
pairs(c2009[,2:5])

mark

上面的绘图没有使用ggplot2,是因为它不能绘制散点图矩阵(至少绘制的效果不佳),上述绘图过得也可以使用自定义的面板函数,现在我们定义一个panel.cor函数来展示变量两两之间的相关系数,以代替默认的散点图。相关系数较大的位置将对应较大的字符,如下所示:

1
2
3
4
5
6
7
8
9
10
panel.cor <- function(x,y,digits=2, prefix="",cex.cor,...){
usr <- par("usr")
on.exit(par("usr"))
par(usr=c(0,1,0,1))
r <- abs(cor(x,y,use="complete.obs"))
txt <- format(c(r,0.123456789),digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5,0.5,txt,cex=cex.cor*(1+r)/2)
}

为了在面板的对角线上展示各个变量的直方图,我们再定义一个panel.hist函数,如下所示:

1
2
3
4
5
6
7
8
9
10
11
panel.hist <- function(x,...){
usr <-par("usr")
on.exit(par(usr))
par(usr=c(usr[1:2],0,1.5))
h<-hist(x,plot=FALSE)
breaks<-h$breaks
nB <- length(breaks)
y<-h$counts
y<-y/max(y)
rect(breaks[-nB],0,breaks[-1],y,col="white",...)
}

上面的函数都源于pairs函数的帮助页面,可以直接复制这个函数的相关代码,定义了这些函数后,我们哦可以高骼它来绘制散点图矩阵。令pairs()函数在面板的上三角执行panel.cor函数,在面板的对角线执行panel.hist函数,在绘图时也添加了一些东西,在面板的下三角执行panel.smooth函数,这个函数将在散点图矩阵的下三角绘制散点图,并添加一个LOWESS平滑曲线,如下所示:

1
2
3
pairs(c2009[,2:5],upper.panel=panel.cor,
diag.panel = panel.hist,
lower.panel = panel.smooth)

mark

也许我们希望用线性模型来代替LOWESS模型,panel.lm函数可以完成该操作,如下所示:

1
2
3
4
5
panel.lm <- function(x,y,col=par("col"),bg=NA,pch=par("pch"),
cex=1,col.smooth="black",...){
points(x,y,pch=pch,col=col,bg=bg,cex=cex)
abline(stats::lm(y~x),col=col.smooth,...)
}

这次系统默认的线条颜色不再是红色,而是黑色,调用函数pairs()时(与函数panel.smooth配合使用),设定col.smooth参数可以对线条颜色进行修改,为了便于辩认数据点,我们在图中使用更小的点,该操作可以通过pch="."来实现,如下所示:

1
2
3
4
pairs(c2009[,2:5],pch=".",
upper.panel=panel.cor,
diag.panel = panel.hist,
lower.panel = panel.lm)

mark

cex参数可以控制图中点的大小,cex参数的默认值是1,其值越大,数据点越大,但如果cex小于0.5,那么图形输出为PDF文件时可能无法很好地渲染。

GGally包的ggpairs函数

使用GGally包的ggpairs函数也能绘制散点矩阵图,如下所示:

1
ggpairs(c2009[,2:5])

mark

参考资料

  1. 常肖楠, 邓一硕, 魏太云. R数据可视化手册[M]. 人民邮电出版社, 2014.